# Visual language understanding
Gemma 3 4b It Qat GGUF
Gemma 3 is a lightweight, advanced open model series from Google, built on the same research and technology used to create Gemini models. This model is multimodal, capable of processing both text and image inputs to generate text outputs.
Text-to-Image English
G
unsloth
2,629
2
VL Rethinker 7B Mlx 4bit
Apache-2.0
VL-Rethinker-7B 4-bit MLX Quantized Version is a quantized variant of the TIGER-Lab/VL-Rethinker-7B model, optimized for Apple devices and supporting visual question-answering tasks.
Text-to-Image English
V
TheCluster
14
0
Qwen Qwen2.5 VL 32B Instruct GGUF
Apache-2.0
Qwen2.5-VL-32B-Instruct is a multimodal vision-language model with a parameter scale of 32B, supporting image understanding and text generation tasks.
Text-to-Image English
Q
bartowski
2,782
1
Qwen2 Vl 7b Rslora Offensive Meme Singapore
MIT
A visual language model for classifying offensive emojis in the Singapore context, fine-tuned based on Qwen2-VL-7B-Instruct
Multimodal Fusion
Transformers English

Q
aliencaocao
1,684
0
Mulberry Qwen2vl 7b
Apache-2.0
The Mulberry model is a step-by-step reasoning-based model trained on the Mulberry - 260K SFT dataset generated through collective knowledge search.
Text-to-Image
Transformers

M
HuanjinYao
13.57k
1
Migician
Apache-2.0
The Magician is the first multi-modal large language model with free-form multi-image localization capabilities, achieving precise localization in complex multi-image scenarios and outperforming models with a scale of 70B in performance.
Text-to-Image
Transformers English

M
Michael4933
83
1
Open LLaVA NeXT LLaMA3 8B
Apache-2.0
An open-source chatbot model trained by fine-tuning the entire model on open-source data, which can be used for research on multimodal models and chatbots.
Text-to-Image
Safetensors
O
Share4oReasoning
215
0
Qwen2 VL 7B Instruct GGUF
Apache-2.0
Qwen2-VL-7B-Instruct is a multimodal vision-language model that supports the joint understanding and generation of images and text.
Text-to-Image
Transformers English

Q
tensorblock
124
0
Glm Edge V 5b
Other
GLM-Edge-V-5B is a 5-billion-parameter multimodal model that supports image and text inputs, capable of performing image understanding and text generation tasks.
Image-to-Text
G
THUDM
4,357
12
Featured Recommended AI Models